Kaggle Shelter Animal Outcomes

https://www.kaggle.com/c/shelter-animal-outcomes

The data comes from Austin Animal Center from October 1st, 2013 to March, 2016. Outcomes represent the status of animals as they leave the Animal Center. All animals receive a unique Animal ID during intake.

In this competition, you are going to predict the outcome of the animal as they leave the Animal Center. These outcomes include: Adoption, Died, Euthanasia, Return to owner, and Transfer.

The train and test data are randomly split.

Data analysis

Import common packages


In [82]:
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

Load train and test datasets


In [83]:
train = pd.read_csv('train.csv.gz', parse_dates=['DateTime'], index_col='AnimalID')
test = pd.read_csv('test.csv.gz', parse_dates=['DateTime'], index_col='ID')

Lets take a look on train and test datasets


In [84]:
train.head()


Out[84]:
Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome AgeuponOutcome Breed Color
AnimalID
A671945 Hambone 2014-02-12 18:22:00 Return_to_owner NaN Dog Neutered Male 1 year Shetland Sheepdog Mix Brown/White
A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering Cat Spayed Female 1 year Domestic Shorthair Mix Cream Tabby
A686464 Pearce 2015-01-31 12:28:00 Adoption Foster Dog Neutered Male 2 years Pit Bull Mix Blue/White
A683430 NaN 2014-07-11 19:09:00 Transfer Partner Cat Intact Male 3 weeks Domestic Shorthair Mix Blue Cream
A667013 NaN 2013-11-15 12:52:00 Transfer Partner Dog Neutered Male 2 years Lhasa Apso/Miniature Poodle Tan

In [85]:
test.head()


Out[85]:
Name DateTime AnimalType SexuponOutcome AgeuponOutcome Breed Color
ID
1 Summer 2015-10-12 12:15:00 Dog Intact Female 10 months Labrador Retriever Mix Red/White
2 Cheyenne 2014-07-26 17:59:00 Dog Spayed Female 2 years German Shepherd/Siberian Husky Black/Tan
3 Gus 2016-01-13 12:20:00 Cat Neutered Male 1 year Domestic Shorthair Mix Brown Tabby
4 Pongo 2013-12-28 18:12:00 Dog Intact Male 4 months Collie Smooth Mix Tricolor
5 Skooter 2015-09-24 17:59:00 Dog Neutered Male 2 years Miniature Poodle Mix White

Train'n'Test proportion


In [86]:
sns.barplot(x=['train', 'test'], y=[len(train), len(test)], palette="BuGn_d")


Out[86]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0d1ba12b0>

Target distribution


In [87]:
outcometype_dist = train['OutcomeType'].value_counts(normalize=True)
sns.barplot(x=outcometype_dist.index, y=outcometype_dist.values, palette="BuGn_d")


Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0d1ba9f28>

In [107]:
def factor_plot(data, x, hue):
    hue_dist = data[hue].value_counts()
    hue_frac_col = '{}_fraction'.format(hue)
    train[hue_frac_col] = train[hue].map(lambda v: 1/hue_dist[v])
    sns.factorplot(x=x, y=hue_frac_col, hue=hue, data=train, estimator=sum, kind='bar')
    data = data.drop(hue_frac_col, axis=1)

OutcomeType and AnimalType correlation

AnimalType distribution


In [89]:
animaltype_dist = train['AnimalType'].value_counts(normalize=True)
sns.barplot(x=animaltype_dist.index, y=animaltype_dist.values, palette='BuGn_d')


Out[89]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0d19762e8>

In [108]:
factor_plot(train, 'OutcomeType', 'AnimalType')



In [311]:
train['SexuponOutcome'] = train['SexuponOutcome'].fillna('Unknown')
test['SexuponOutcome'] = test['SexuponOutcome'].fillna('Unknown')
def extract_sex(sex):
    if 'Female' in sex:
        return 'Female'
    if 'Male' in sex:
        return 'Male'
    return sex

train['Sex'] = train['SexuponOutcome'].map(extract_sex)
test['Sex'] = test['SexuponOutcome'].map(extract_sex)

In [113]:
animaltype_dist = train['Sex'].value_counts(normalize=True)
sns.barplot(x=animaltype_dist.index, y=animaltype_dist.values, palette='BuGn_d')


Out[113]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0d14b6588>

In [114]:
factor_plot(train, 'OutcomeType', 'Sex')



In [303]:
def extract_intact(sex):
    if 'Intact' in sex:
        return 'Intact'
    if 'Spayed' in sex or 'Neutered' in sex:
        return 'Spayed'
    return sex

train['Intact'] = train['SexuponOutcome'].map(extract_intact)
test['Intact'] = test['SexuponOutcome'].map(extract_intact)

In [116]:
animaltype_dist = train['Intact'].value_counts(normalize=True)
sns.barplot(x=animaltype_dist.index, y=animaltype_dist.values, palette='BuGn_d')


Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0d155d6a0>

In [117]:
factor_plot(train, 'OutcomeType', 'Intact')



In [304]:
import re

def extract_age(age):
    if pd.isnull(age):
        return np.nan
    days_in = {
        'day': 1,
        'week': 7,
        'month': 30,
        'year': 365,
    }
    
    m = re.match('(?P<num>\d+)\s+(?P<period>\w+)', age)
    num = int(m.group('num'))
    period = m.group('period')
    if period.endswith('s'):
        period = period[:-1]
    return num * days_in[period]

train['Age'] = train['AgeuponOutcome'].map(extract_age)
test['Age'] = test['AgeuponOutcome'].map(extract_age)

In [137]:
sns.distplot(train['Age'].dropna(), kde=False, norm_hist=True)


Out[137]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0d12b8ef0>

In [141]:
plt.figure(figsize=(10,5))
sns.violinplot(x='OutcomeType', y='Age', data=train)


Out[141]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0d1a8c898>

In [582]:
for dataset in (train, test):
    dataset['Year'] = dataset['DateTime'].map(lambda dt: dt.year)
    dataset['Quarter'] = dataset['DateTime'].map(lambda dt: dt.quarter)
    dataset['Month'] = dataset['DateTime'].map(lambda dt: dt.month)
    dataset['Day'] = dataset['DateTime'].map(lambda dt: dt.day)
    dataset['DayOfWeek'] = dataset['DateTime'].map(lambda dt: dt.dayofweek)
    dataset['Hour'] = dataset['DateTime'].map(lambda dt: dt.hour)
    dataset['Minute'] = dataset['DateTime'].map(lambda dt: dt.minute)

In [149]:
year_dist = train['Year'].value_counts(normalize=True)
sns.barplot(x=year_dist.index, y=year_dist.values, palette='BuGn_d')


Out[149]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0cfd9f5f8>

In [150]:
factor_plot(train, 'OutcomeType', 'Year')



In [152]:
quarter_dist = train['Quarter'].value_counts(normalize=True)
sns.barplot(x=quarter_dist.index, y=quarter_dist.values, palette='BuGn_d')


Out[152]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0d10e4630>

In [153]:
factor_plot(train, 'OutcomeType', 'Quarter')



In [155]:
month_dist = train['Month'].value_counts(normalize=True)
sns.barplot(x=month_dist.index, y=month_dist.values, palette='BuGn_d')


Out[155]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0d029b860>

In [157]:
sns.violinplot(x='OutcomeType', y='Month', data=train)


Out[157]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0d01a25f8>

In [160]:
dayofweek_dist = train['DayOfWeek'].value_counts(normalize=True)
sns.barplot(x=dayofweek_dist.index, y=dayofweek_dist.values, palette='BuGn_d')


Out[160]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0634e1c18>

In [161]:
sns.violinplot(x='OutcomeType', y='DayOfWeek', data=train)


Out[161]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0636ecf98>

In [162]:
hour_dist = train['Hour'].value_counts(normalize=True)
sns.barplot(x=hour_dist.index, y=hour_dist.values, palette='BuGn_d')


Out[162]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0635796a0>

In [163]:
sns.violinplot(x='OutcomeType', y='Hour', data=train)


Out[163]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb063b2a438>

In [583]:
sns.violinplot(x='OutcomeType', y='Minute', data=train)


Out[583]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0649a3ef0>

In [316]:
for dataset in (train, test):
    dataset['BreedMix'] = dataset['Breed'].map(lambda b: 'Mix' in b)
    dataset['Longhair'] = dataset['Breed'].map(lambda b: 'Longhair' in b)
    dataset['Shorthair'] = dataset['Breed'].map(lambda b: 'Shorthair' in b)

In [166]:
breed_mix_dist = train['BreedMix'].value_counts(normalize=True)
sns.barplot(x=breed_mix_dist.index, y=breed_mix_dist.values, palette='BuGn_d')


Out[166]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0636a7f28>

In [167]:
factor_plot(train, 'OutcomeType', 'BreedMix')



In [169]:
longhair_mix_dist = train['Longhair'].value_counts(normalize=True)
sns.barplot(x=longhair_mix_dist.index, y=longhair_mix_dist.values, palette='BuGn_d')


Out[169]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb063ff7080>

In [170]:
factor_plot(train, 'OutcomeType', 'Longhair')



In [218]:
factor_plot(train, 'OutcomeType', 'Shorthair')



In [306]:
for dataset in (train, test):
    daysofweek = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    dataset[daysofweek] = pd.get_dummies(dataset['DayOfWeek'].map(lambda d: daysofweek[d]), columns=daysofweek)[daysofweek]

In [307]:
for dataset in (train, test):
    dataset['HasName'] = dataset['Name'].isnull().map(lambda t: not t)

In [190]:
factor_plot(train, 'OutcomeType', 'HasName')



In [308]:
breeds_dist = train.append(test)['Breed'].value_counts(normalize=True)
for dataset in (train, test):
    dataset['BreedPopularity'] = dataset['Breed'].map(lambda b: breeds_dist[b])

In [272]:
sns.violinplot(x='OutcomeType', y='BreedPopularity', data=train)


Out[272]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb0658aa748>

In [309]:
colors_dist = train.append(test)['Color'].value_counts(normalize=True)
for dataset in (train, test):
    dataset['ColorPopularity'] = dataset['Color'].map(lambda c: colors_dist[c])

In [274]:
sns.violinplot(x='OutcomeType', y='ColorPopularity', data=train)


Out[274]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb065a92ac8>

In [535]:
name_dist = train.append(test)['Name'].value_counts(normalize=True)
for dataset in (train, test):
    dataset['NamePopularity'] = dataset['Name'].map(lambda n: np.nan if pd.isnull(n) else name_dist[n])

In [536]:
sns.violinplot(x='OutcomeType', y='NamePopularity', data=train)


Out[536]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb063ad6278>

In [540]:
for dataset in (train, test):
    dataset['NameLength'] = dataset['Name'].map(lambda n: 0 if pd.isnull(n) else len(n))

In [541]:
sns.violinplot(x='OutcomeType', y='NameLength', data=train)


Out[541]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb063dab1d0>

In [552]:
for dataset in (train, test):
    dataset['SimpleColor'] = dataset['Color'].map(lambda c: not '/' in c).astype(int)

In [553]:
factor_plot(train, "OutcomeType", "SimpleColor")



In [517]:
from sklearn.preprocessing import LabelEncoder

sex_encoder = LabelEncoder().fit(train['Sex'])
intact_encoder = LabelEncoder().fit(train['Intact'])
animaltype_encoder = LabelEncoder().fit(train['AnimalType'])
outcometype_encoder = LabelEncoder().fit(train['OutcomeType'])
outcomesubtype_encoder = LabelEncoder().fit(train['OutcomeSubtype'])
breed_encoder = LabelEncoder().fit(train.append(test)['Breed'])
color_encoder = LabelEncoder().fit(train.append(test)['Color'])
age_median = train['Age'].median()

train['OutcomeTypeEncoded'] = outcometype_encoder.transform(train['OutcomeType'])
train['OutcomeSubtypeEncoded'] = outcomesubtype_encoder.transform(train['OutcomeSubtype'])

for dataset in (train, test):
    dataset['SexEncoded'] = sex_encoder.transform(dataset['Sex'])
    dataset['IntactEncoded'] = intact_encoder.transform(dataset['Intact'])
    dataset['AnimalTypeEncoded'] = animaltype_encoder.transform(dataset['AnimalType'])
    dataset['BreedEncoded'] = breed_encoder.transform(dataset['Breed'])
    dataset['ColorEncoded'] = color_encoder.transform(dataset['Color'])
    dataset['AgeFilled'] = dataset['Age'].fillna(age_median)

outcomesubtype_columns = ['OutcomeSubtype_{}'.format(subtype) for subtype in outcomesubtype_encoder.classes_]
for column_name, subtype in zip(outcomesubtype_columns, outcomesubtype_encoder.classes_):
    train[column_name] = (train['OutcomeSubtype'] == subtype).astype(int)

In [584]:
features = [
    'AgeFilled',
    'AnimalTypeEncoded',
    'SexEncoded',
    'IntactEncoded',
    'HasName',
    'Year',
    'Month',
    'Quarter',
    'Hour',
    'Minute',
    'BreedMix',
#     'BreedPopularity',
#     'Longhair',
#     'Shorthair',
#     'ColorPopularity',
    *daysofweek,
#     *outcomesubtype_columns,
#     'OutcomeSubtypeEncoded',
    'BreedEncoded',
#     'ColorEncoded',
#     'NamePopularity',
#     'NameLength',
#     'SimpleColor',
]

target = 'OutcomeTypeEncoded'

X = train[features]
y = train[target]

In [559]:
params = {
    'n_estimators': 100,
    'max_depth': 9,
    'subsample': 0.8,
    'colsample_bytree': 0.85,
    'seed': 42,
}

In [511]:
from sklearn.grid_search import RandomizedSearchCV, GridSearchCV
from xgboost import XGBClassifier

param_grid = {
    'n_estimators': np.arange(50, 120, 10),
    'max_depth': np.arange(4, 12),
    'subsample': np.linspace(0.7, 1.0, 10),
    'colsample_bytree': np.linspace(0.7, 1.0, 10),
}

grid_search = RandomizedSearchCV(XGBClassifier(**params), param_grid, cv=3, scoring='log_loss', verbose=True, n_iter=100)

%time grid_search.fit(X, y).best_score_


Fitting 3 folds for each of 100 candidates, totalling 300 fits
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:  1.5min
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:  5.6min
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:  8.6min finished
CPU times: user 43min 43s, sys: 3min 40s, total: 47min 24s
Wall time: 8min 38s
Out[511]:
-0.7466863879594825

In [572]:
grid_search.best_params_


Out[572]:
{'colsample_bytree': 0.79999999999999993,
 'max_depth': 8,
 'n_estimators': 110,
 'subsample': 0.8666666666666667}

In [585]:
from xgboost import plot_importance

params1 = params.copy()
params1.update(grid_search.best_params_)
xgb = XGBClassifier(**params1).fit(X, y)
plot_importance(xgb)


Out[585]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb064a53710>

In [604]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [587]:
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import log_loss
from sklearn.preprocessing import label_binarize

params1 = params.copy()
params1.update(grid_search.best_params_)
scores1 = cross_val_score(XGBClassifier(**params1), X, y, cv=5, scoring='log_loss')
print("mean_score: {:0.6f}, score_std: {:0.6f}".format(scores1.mean(), scores1.std()))


mean_score: -0.733239, score_std: 0.006916

In [588]:
params2 = params1.copy()
params2['colsample_bytree'] = 0.65
scores2 = cross_val_score(XGBClassifier(**params2), X, y, cv=5, scoring='log_loss')
print("mean_score: {:0.6f}, score_std: {:0.6f}".format(scores2.mean(), scores2.std()))


mean_score: -0.730661, score_std: 0.006401

In [589]:
params3 = params.copy()
params3['colsample_bytree'] = 0.75
scores3 = cross_val_score(XGBClassifier(**params3), X, y, cv=5, scoring='log_loss')
print("mean_score: {:0.6f}, score_std: {:0.6f}".format(scores3.mean(), scores3.std()))


mean_score: -0.731780, score_std: 0.005564

In [647]:
y_pred1 = XGBClassifier(**params1).fit(X_train, y_train).predict_proba(X_test)
y_pred2 = XGBClassifier(**params2).fit(X_train, y_train).predict_proba(X_test)
y_pred3 = XGBClassifier(**params3).fit(X_train, y_train).predict_proba(X_test)

In [648]:
log_loss(y_test, y_pred1)


Out[648]:
0.73180711390645714

In [649]:
log_loss(y_test, y_pred2)


Out[649]:
0.72945161018616578

In [650]:
log_loss(y_test, y_pred3)


Out[650]:
0.73073642536500605

In [651]:
log_loss(y_test, (y_pred1 + y_pred2 + y_pred3)/3)


Out[651]:
0.72741441605993284

In [652]:
log_loss(y_test, np.power(y_pred1 * y_pred2 * y_pred3, 1/3))


Out[652]:
0.72758277170479424

In [653]:
log_loss(y_test, 3/(1/y_pred1 + 1/y_pred2 + 1/y_pred3))


Out[653]:
0.72779619416924557

In [654]:
from sklearn.calibration import CalibratedClassifierCV

xgb1_calibrated = CalibratedClassifierCV(XGBClassifier(**params1), cv=10, method='isotonic').fit(X_train, y_train)
y_pred_calib1 = xgb1_calibrated.predict_proba(X_test)
log_loss(y_test, y_pred_calib1)


Out[654]:
0.72813229276146441

In [655]:
xgb2_calibrated = CalibratedClassifierCV(XGBClassifier(**params2), cv=10, method='isotonic').fit(X_train, y_train)
y_pred_calib2 = xgb2_calibrated.predict_proba(X_test)
log_loss(y_test, y_pred_calib2)


Out[655]:
0.73558475095768894

In [656]:
xgb3_calibrated = CalibratedClassifierCV(XGBClassifier(**params3), cv=10, method='isotonic').fit(X_train, y_train)
y_pred_calib3 = xgb3_calibrated.predict_proba(X_test)
log_loss(y_test, y_pred_calib3)


Out[656]:
0.73275698976444292

In [657]:
log_loss(y_test, (y_pred1 + y_pred2 + y_pred3 + y_pred_calib1 + y_pred_calib2 + y_pred_calib3)/6)


Out[657]:
0.72475547909189875

In [658]:
log_loss(y_test, (y_pred_calib1 + y_pred_calib2 + y_pred_calib3)/3)


Out[658]:
0.7244142160340894

In [659]:
log_loss(y_test, np.power(y_pred1 * y_pred2 * y_pred3, 1/3))


Out[659]:
0.72758277170479424

In [698]:
from sklearn.ensemble import RandomForestClassifier

rf_estimator = RandomForestClassifier(n_estimators=90, max_depth=14).fit(X_train, y_train)
rf_pred = rf_estimator.predict_proba(X_test)
log_loss(y_test, rf_pred)


Out[698]:
0.74982734867920586

In [707]:
from scipy.optimize import minimize

def target_fn(x):
    return log_loss(y_test, x[0]*rf_pred + x[1]*y_pred2 + x[2]*y_pred3 + x[3]*y_pred_calib1 + x[4]*y_pred_calib2 + x[5]*y_pred_calib3)

def norm_consraint(x):
    return np.sum(x) - 1

xopt = minimize(target_fn, [1/6]*6, bounds=[[0, 1] for i in range(6)], constraints=({'type': 'eq', 'fun': norm_consraint}), tol=1e-14)
xopt


Out[707]:
     fun: 0.72153582590073995
     jac: array([ -9.05916095e-05,   6.17079437e-04,   5.51802665e-03,
         2.24620849e-03,   7.24270940e-05,  -9.06065106e-05,
         0.00000000e+00])
 message: 'Optimization terminated successfully.'
    nfev: 305
     nit: 35
    njev: 35
  status: 0
 success: True
       x: array([  1.31537379e-01,   0.00000000e+00,   9.97620177e-02,
         3.75720735e-18,   3.21269827e-17,   7.68700603e-01])

In [682]:
alpha = xopt.x

In [669]:
xgb1 = XGBClassifier(**params1).fit(X, y)
xgb2 = XGBClassifier(**params2).fit(X, y)
xgb3 = XGBClassifier(**params3).fit(X, y)

xgb1_calibrated.fit(X, y)
xgb2_calibrated.fit(X, y)
xgb3_calibrated.fit(X, y)


---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-669-9520474750ee> in <module>()
      3 xgb3 = XGBClassifier(**params3).fit(X, y)
      4 
----> 5 xgb1_calibrated.fit(X, y)
      6 xgb2_calibrated.fit(X, y)
      7 xgb3_calibrated.fit(X, y)

/opt/conda/lib/python3.5/site-packages/sklearn/calibration.py in fit(self, X, y, sample_weight)
    172                         sample_weight=base_estimator_sample_weight[train])
    173                 else:
--> 174                     this_estimator.fit(X[train], y[train])
    175 
    176                 calibrated_classifier = _CalibratedClassifier(

/opt/conda/lib/python3.5/site-packages/xgboost/sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose)
    341                               early_stopping_rounds=early_stopping_rounds,
    342                               evals_result=evals_result, feval=feval,
--> 343                               verbose_eval=verbose)
    344 
    345         if evals_result:

/opt/conda/lib/python3.5/site-packages/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, learning_rates, xgb_model)
    119     if not early_stopping_rounds:
    120         for i in range(num_boost_round):
--> 121             bst.update(dtrain, i, obj)
    122             nboost += 1
    123             if len(evals) != 0:

/opt/conda/lib/python3.5/site-packages/xgboost/core.py in update(self, dtrain, iteration, fobj)
    692 
    693         if fobj is None:
--> 694             _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, iteration, dtrain.handle))
    695         else:
    696             pred = self.predict(dtrain)

KeyboardInterrupt: 

In [641]:
y_pred1 = xgb1.predict_proba(test[features])
y_pred2 = xgb2.predict_proba(test[features])
y_pred3 = xgb3.predict_proba(test[features])

y_pred_calibrated1 = xgb1_calibrated.predict_proba(test[features])
y_pred_calibrated2 = xgb2_calibrated.predict_proba(test[features])
y_pred_calibrated3 = xgb3_calibrated.predict_proba(test[features])
y_pred = (alpha[0]*y_pred1 + alpha[1]*y_pred2 + alpha[2]*y_pred3 + alpha[3]*y_pred_calibrated1 + alpha[4]*y_pred_calibrated2 + alpha[5]*y_pred_calibrated3)
# y_pred = y_pred2

In [642]:
submission = pd.DataFrame(index=test.index)
for i, outcome_type in enumerate(outcometype_encoder.classes_):
    submission[outcome_type] = y_pred[:, i]

In [643]:
submission.sum().plot(kind='bar')


Out[643]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb064cf07b8>

In [644]:
submission.to_csv('pred.csv')

In [ ]: